Remove cpu limit for rayservice e2e test by AndySung320 · Pull Request #4859 · ray-project/kuberay

AndySung320 · 2026-05-22T06:46:36Z

Why are these changes needed?

Remove CPU resource limits from RayService e2e test specs.

Previously, a 500m CPU limit on the head pod caused dashboard startup timeouts and flaky tests (fixed in #4702 by raising the limit to 1).
However, CPU limits are unnecessary in this test environment and can still cause throttling under load. Removing them entirely eliminates this type of flakiness rather than tuning the limit value.
ref

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: AndySung320 <andysung0320@gmail.com>

andrewsykim

Overall makes sense to me, would not be surprised if CPU limits contribute to some level of flakiness in e2e tests

andrewsykim · 2026-05-22T14:37:04Z

              price: 2
            ray_actor_options:
              num_cpus: 0.1
+          - name: PearStand


Why add PearStand?

According to Ray’s official core spec, actors default to num_cpus=1 for scheduling if not explicitly specified.
Because PearStand was defined in the graph but omitted in our serveConfigV2, it didn't get any custom ray_actor_options, so Ray automatically assigned it the default 1 CPU token.
Previously, this was masked because our head node had limits.cpu: 2 (which made KubeRay pass --num-cpus=2 to Ray). Now that we removed the limit, KubeRay falls back to using requests.cpu: 1. With only 1 total CPU token available in Ray, PearStand's default 1-CPU demand broke the budget and caused the scheduling failure.
Adding PearStand here with num_cpus=0.1 explicitly overrides Ray's 1-CPU default and aligns it with other deployments.

See the controller log showing PearStand failed to schedule with only 0.4 CPU available:

Signed-off-by: AndySung320 <andysung0320@gmail.com>

AndySung320 · 2026-05-22T23:00:20Z

The worker CPU limit in rayservice.autoscaling.yaml is intentionally kept.
The autoscaling test verifies that worker pods scale up under load. Without a CPU limit, Ray may detect enough CPUs on a single worker to fit all replicas, preventing additional worker pods from being created and causing the test to fail.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 5a93c83. Configure here.}

cursor · 2026-05-22T23:01:54Z

                  cpu: "1"
                  memory: 1G
                limits:
-                  cpu: "1"


Worker CPU limit not removed in autoscaling YAML

Medium Severity

The CPU limit (cpu: "500m") on the worker pod in rayservice.autoscaling.yaml was not removed, while CPU limits were consistently removed from both head and worker pods in all other RayService test YAML files (rayservice.static.yaml, rayservice.deletiondelay.yaml, ray-service.ft.yaml). This appears to be an oversight that leaves the autoscaling worker pod susceptible to the same CPU throttling-related flakiness this PR intends to eliminate.

^{Reviewed by Cursor Bugbot for commit 5a93c83. Configure here.}

see #4859 (comment)

remove cpu limit for rayservice e2e test

7ff6508

Signed-off-by: AndySung320 <andysung0320@gmail.com>

AndySung320 changed the title ~~remove cpu limit for rayservice e2e test~~ Remove cpu limit for rayservice e2e test May 22, 2026

AndySung320 added 2 commits May 22, 2026 00:33

trigger ci

2b815be

Signed-off-by: AndySung320 <andysung0320@gmail.com>

Add PearStand deployment config with num_cpus

d8248a1

Signed-off-by: AndySung320 <andysung0320@gmail.com>

andrewsykim reviewed May 22, 2026

View reviewed changes

AndySung320 marked this pull request as ready for review May 22, 2026 20:25

cursor Bot reviewed May 22, 2026

View reviewed changes

Comment thread ray-operator/test/e2erayservice/testdata/rayservice.deletiondelay.yaml

AndySung320 added 2 commits May 22, 2026 13:37

remove cpu limit in deletiondelay yaml

e795b3b

Signed-off-by: AndySung320 <andysung0320@gmail.com>

add back cpu limit for worker spec

5a93c83

Signed-off-by: AndySung320 <andysung0320@gmail.com>

cursor Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove cpu limit for rayservice e2e test#4859

Remove cpu limit for rayservice e2e test#4859
AndySung320 wants to merge 5 commits into
ray-project:masterfrom
AndySung320:remove-cpu-limit

AndySung320 commented May 22, 2026

Uh oh!

andrewsykim left a comment

Uh oh!

andrewsykim May 22, 2026

Uh oh!

AndySung320 May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

AndySung320 commented May 22, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 22, 2026

Uh oh!

AndySung320 May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AndySung320 commented May 22, 2026

Why are these changes needed?

Related issue number

Checks

Uh oh!

andrewsykim left a comment

Choose a reason for hiding this comment

Uh oh!

andrewsykim May 22, 2026

Choose a reason for hiding this comment

Uh oh!

AndySung320 May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AndySung320 commented May 22, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 22, 2026

Choose a reason for hiding this comment

Worker CPU limit not removed in autoscaling YAML

Uh oh!

AndySung320 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AndySung320 May 22, 2026 •

edited

Loading